Kapitel 6.2: Zentralität – Vektoren Encoding¶
Das Notebook ergänzt Kapitel 6.2 'Zentralität'.
Import¶
In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from tqdm.notebook import tqdm
from resources_geschichtslyrik import *
from sklearn.preprocessing import OneHotEncoder
from sklearn.manifold import MDS
from itertools import product
from scipy.spatial.distance import cityblock, euclidean, cosine
In [2]:
pd.set_option('display.max_colwidth', None)
In [3]:
meta = pd.read_json(r"../resources/meta.json")
Korpora¶
In [4]:
meta_anth = (
meta
.query("corpus=='anth'")
.query("1850 <= year <= 1918")
.query("geschichtslyrik == 1")
.drop_duplicates(subset='author_title')
.reset_index(drop = True)
)
In [5]:
modcanon_authors = ['Hofmannsthal, Hugo von', 'Rilke, Rainer Maria', 'George, Stefan', 'Heym, Georg']
meta_modcanon = (
meta
.query("author in @modcanon_authors")
.query("1850 <= year <= 1918")
.query("geschichtslyrik == 1")
.drop_duplicates(subset='author_title')
.reset_index(drop = True)
)
In [6]:
muench_authors = ['Münchhausen, Börries von', 'Miegel, Agnes', 'Strauß und Torney, Lulu von']
meta_muench = (
meta
.query("author in @muench_authors")
.query("1850 <= year <= 1918")
.query("geschichtslyrik == 1")
.drop_duplicates(subset='author_title')
.reset_index(drop = True)
)
In [7]:
meta_all = pd.concat([meta_anth, meta_modcanon, meta_muench])
meta_all = meta_all.drop_duplicates(subset = 'id')
meta_all = meta_all.reset_index(drop = True)
meta_all['korpus_anth'] = [True if x in list(meta_anth['author_title']) else False for x in meta_all['author_title']]
meta_all['korpus_modcanon'] = [True if x in modcanon_authors else False for x in meta_all['author']]
meta_all['korpus_muench'] = [True if x in muench_authors else False for x in meta_all['author']]
meta_all.shape[0]
Out[7]:
2063
Ratings¶
In [8]:
stoffgebiet_ratings = get_rating_table(meta = meta_all, mode = 'themes')
entity_ratings = get_rating_table(meta = meta_all, mode = 'entity')
Feature-Überblick¶
- features_all_df : DataFrame mit Angaben für alle zu nutzenden Features (Name, Encoding, Gewicht)
In [9]:
features_data = [
['geschichtslyrik', 'ordinal', 1],
['empirisch', 'bin', 1],
['theoretisch', 'bin', 1],
['gattung', 'nominal_multi', 1],
['sprechinstanz_markiert', 'bin', 1],
['sprechinstanz_in_vergangenheit', 'nominal', 1],
['sprechakte', 'nominal_multi', 1],
['tempus', 'nominal_multi', 1],
['konkretheit', 'ordinal', 1],
['wissen', 'nominal', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
['vergangenheitsdominant', 'ordinal', 1],
['zeitebenen', 'interval', 1],
['fixierbarkeit', 'bin', 1],
['beginn', 'interval', 1], # intervall ist Vereinfachung, vgl. NaN
['ende', 'interval', 1], # intervall ist Vereinfachung, vgl. NaN
['anachronismus', 'bin', 1],
['gegenwartsbezug', 'bin', 1],
['grossraum', 'nominal_multi', 1],
['mittelraum', 'nominal_multi', 1],
['kleinraum', 'nominal_multi', 1],
['inhaltstyp', 'nominal_multi', 1],
['stoffgebiet', 'nominal_multi_sim', 1],
['stoffgebiet_bewertung', 'nominal_multi_dependent', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
['entity_simple', 'nominal_multi', 1],
['entity_bewertung', 'nominal_multi_dependent', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
['nationalismus', 'bin', 1],
['heroismus', 'bin', 1],
['religiositaet', 'bin', 1],
['marker_person', 'bin_multi', 1],
['marker_zeit', 'bin_multi', 1],
['marker_ort', 'bin_multi', 1],
['marker_objekt', 'bin_multi', 1],
['ueberlieferung', 'bin', 1],
['ueberlieferung_bewertung', 'nominal', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
['geschichtsauffassung', 'bin', 1],
['geschichtsauffassung_bewertung', 'nominal', 1], # hier nominal statt ordinal, u. a. da ambivalent vs. neutral
['verhaeltnis_wissen', 'nominal_multi', 1], # hier nominal statt ordinal, u. a. da natürlich vs. übernatürlich
['reim', 'ordinal', 1],
['metrum', 'ordinal', 1],
['verfremdung', 'ordinal', 1],
]
features_all_df = pd.DataFrame(features_data, columns=['feature', 'encoding', 'weight'])
In [10]:
features_all_df.head()
Out[10]:
| feature | encoding | weight | |
|---|---|---|---|
| 0 | geschichtslyrik | ordinal | 1 |
| 1 | empirisch | bin | 1 |
| 2 | theoretisch | bin | 1 |
| 3 | gattung | nominal_multi | 1 |
| 4 | sprechinstanz_markiert | bin | 1 |
In [11]:
features_all_df['encoding'].value_counts()
Out[11]:
encoding bin 11 nominal_multi 9 ordinal 6 nominal 4 bin_multi 4 interval 3 nominal_multi_dependent 2 nominal_multi_sim 1 Name: count, dtype: int64
- bin = binär (z. B. Gegenwartsbezug: ja/nein)
- bin_multi = binär mit Mehrfachannotationen (z. B. Personenmarker: Titel ja/nein und Text ja/nein)
- ordinal = ordinal (z. B. Reim: gar nicht, teilweise, durchgängig)
- ordinal_multi_dependent = ordinal mit Mehrfachannotationen und Bezug auf andere Annotationskategorie
- nominal = nominal (z. B. Zeitebene der Sprechinstanz: nicht markiert, Vergangenheit, nicht Vergangenheit)
- nominal_multi = nominal mit Mehrfachannotationen (z. B. Gattung: 'Ballade', 'Ballade + Lied' usw.)
- nominal_multi_dependent = nominal mit Mehrfachannotationen und Bezug auf andere Annotationskategorie (z. B. Bewertung von Stoffgebieten: negativ, neutral/ambivalent, positiv)
- nominal_multi_sim = nominal mit Mehrfachannotationen, wobei verschiedene Aspekte einander näher stehen können als andere (Stoffgebiete: 'Krieg', 'Krieg + Politik' usw.)
- interval = intervallskaliert (z. B. Anzahl Zeitebenen: 0, 1, 2, 3, 4 ...)
Features im passenden Encoding¶
- features_used_df : DataFrame mit Angaben für alle Features, die bereits im 'richtigen' Encoding vorliegen. Wird nach und nach befüllt.
- features_used : Series mit Namen der Features, die bereits im 'richtigen' Encoding vorliegen
In [12]:
features_proper_encoding = features_all_df.query("encoding == 'bin' | encoding == 'ordinal' | encoding == 'interval'")
features_proper_encoding = features_proper_encoding['feature']
exceptions = [
'beginn', 'ende'
]
features_proper_encoding = [x for x in features_proper_encoding if x not in exceptions]
In [13]:
features_used_df = features_all_df.query("feature in @features_proper_encoding")
features_used_df
Out[13]:
| feature | encoding | weight | |
|---|---|---|---|
| 0 | geschichtslyrik | ordinal | 1 |
| 1 | empirisch | bin | 1 |
| 2 | theoretisch | bin | 1 |
| 4 | sprechinstanz_markiert | bin | 1 |
| 8 | konkretheit | ordinal | 1 |
| 10 | vergangenheitsdominant | ordinal | 1 |
| 11 | zeitebenen | interval | 1 |
| 12 | fixierbarkeit | bin | 1 |
| 15 | anachronismus | bin | 1 |
| 16 | gegenwartsbezug | bin | 1 |
| 25 | nationalismus | bin | 1 |
| 26 | heroismus | bin | 1 |
| 27 | religiositaet | bin | 1 |
| 32 | ueberlieferung | bin | 1 |
| 34 | geschichtsauffassung | bin | 1 |
| 37 | reim | ordinal | 1 |
| 38 | metrum | ordinal | 1 |
| 39 | verfremdung | ordinal | 1 |
In [14]:
features_used = features_used_df['feature']
In [15]:
meta_all[['author', 'title'] + features_used.tolist()].sample(n=5)
Out[15]:
| author | title | geschichtslyrik | empirisch | theoretisch | sprechinstanz_markiert | konkretheit | vergangenheitsdominant | zeitebenen | fixierbarkeit | anachronismus | gegenwartsbezug | nationalismus | heroismus | religiositaet | ueberlieferung | geschichtsauffassung | reim | metrum | verfremdung | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 657 | Gerok, Karl | Heinrichs I. Wahl | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
| 1768 | Hohlbaum, Robert | Luther | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.5 |
| 1630 | Müller von Königswinter, Wolfgang | Das Zepter Rudolfs von Habsburg | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
| 843 | Bergmann, Werner | Bei Rexpoede | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 |
| 1981 | Münchhausen, Börries von | Le Ralli | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
Alle weiteren Features¶
In [16]:
def OneHotSimple (data, category_name): # für nominal
encoder = OneHotEncoder(sparse_output=False)
onehot_df = pd.DataFrame(encoder.fit_transform(np.array(data).reshape(-1,1)))
column_names = [category_name + "_" + x for x in data.sort_values().unique()]
onehot_df.columns = column_names
return onehot_df
In [17]:
def OneHotMulti (data, category_name): # für nominal_multi
values = [x.split(" + ") for x in data]
values = [item for sublist in values for item in sublist]
values = pd.Series(values).unique()
column_names = [category_name + "_" + x for x in values]
onehot_df = pd.DataFrame(columns = column_names)
for i, element in enumerate(meta_all.iloc):
for value in values:
onehot_df.at[i, category_name + "_" + value] = element[category_name].count(value)
onehot_df = onehot_df.fillna(0)
return onehot_df
gattung [nominal_multi]: One-Hot Encoding¶
In [18]:
meta_all['gattung'] = [str(x) for x in meta_all['gattung']]
data = meta_all['gattung']
In [19]:
onehot_df = OneHotMulti(data, 'gattung')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
onehot_df = onehot_df.fillna(0)
In [20]:
onehot_df
Out[20]:
| gattung_Ballade | gattung_Sonett | gattung_Lied | gattung_None | gattung_Rollengedicht | gattung_Denkmal-/Ruinenpoesie | |
|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 2058 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2059 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2060 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2061 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2062 | 0 | 0 | 0 | 1 | 0 | 0 |
2063 rows × 6 columns
In [21]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'author', 'title',
'gattung',
'gattung_Ballade',
'gattung_Lied',
'gattung_Denkmal-/Ruinenpoesie',
'gattung_Rollengedicht',
'gattung_Sonett',
'gattung_None'
]].sample(n=5)
Out[21]:
| author | title | gattung | gattung_Ballade | gattung_Lied | gattung_Denkmal-/Ruinenpoesie | gattung_Rollengedicht | gattung_Sonett | gattung_None | |
|---|---|---|---|---|---|---|---|---|---|
| 407 | Sturm, Julius | Frau Elsa | Ballade | 1 | 0 | 0 | 0 | 0 | 0 |
| 1617 | Lissauer, Ernst | Aus dem Dreißigjährigen Kriege | Ballade | 1 | 0 | 0 | 0 | 0 | 0 |
| 1056 | Köppen, Fedor von | Mein deutsches Volk, o denke dran! | None | 0 | 0 | 0 | 0 | 0 | 1 |
| 2034 | Münchhausen, Börries von | Bayard. Der Ritterschlag | Ballade | 1 | 0 | 0 | 0 | 0 | 0 |
| 873 | Hosäus, Wilhelm | Albrecht der Bär | Ballade | 1 | 0 | 0 | 0 | 0 | 0 |
In [22]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat([
features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
In [23]:
features_used_df
Out[23]:
| feature | encoding | weight | encoding_orig | |
|---|---|---|---|---|
| 0 | geschichtslyrik | ordinal | 1.000000 | NaN |
| 1 | empirisch | bin | 1.000000 | NaN |
| 2 | theoretisch | bin | 1.000000 | NaN |
| 3 | sprechinstanz_markiert | bin | 1.000000 | NaN |
| 4 | konkretheit | ordinal | 1.000000 | NaN |
| 5 | vergangenheitsdominant | ordinal | 1.000000 | NaN |
| 6 | zeitebenen | interval | 1.000000 | NaN |
| 7 | fixierbarkeit | bin | 1.000000 | NaN |
| 8 | anachronismus | bin | 1.000000 | NaN |
| 9 | gegenwartsbezug | bin | 1.000000 | NaN |
| 10 | nationalismus | bin | 1.000000 | NaN |
| 11 | heroismus | bin | 1.000000 | NaN |
| 12 | religiositaet | bin | 1.000000 | NaN |
| 13 | ueberlieferung | bin | 1.000000 | NaN |
| 14 | geschichtsauffassung | bin | 1.000000 | NaN |
| 15 | reim | ordinal | 1.000000 | NaN |
| 16 | metrum | ordinal | 1.000000 | NaN |
| 17 | verfremdung | ordinal | 1.000000 | NaN |
| 18 | gattung_Ballade | bin | 0.166667 | nominal_multi |
| 19 | gattung_Sonett | bin | 0.166667 | nominal_multi |
| 20 | gattung_Lied | bin | 0.166667 | nominal_multi |
| 21 | gattung_None | bin | 0.166667 | nominal_multi |
| 22 | gattung_Rollengedicht | bin | 0.166667 | nominal_multi |
| 23 | gattung_Denkmal-/Ruinenpoesie | bin | 0.166667 | nominal_multi |
sprechinstanz_in_vergangenheit [nominal]: One-Hot Encoding¶
In [24]:
data = meta_all['sprechinstanz_in_vergangenheit']
data = data.replace({float('NaN') : 'unmarkiert', 0 : 'gegenwart', 1 : 'vergangenheit'})
In [25]:
onehot_df = OneHotSimple(data, 'sprechinstanz_in_vergangenheit')
In [26]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'sprechinstanz_in_vergangenheit',
'sprechinstanz_in_vergangenheit_unmarkiert',
'sprechinstanz_in_vergangenheit_vergangenheit',
'sprechinstanz_in_vergangenheit_gegenwart'
]].sample(n=10)
Out[26]:
| sprechinstanz_in_vergangenheit | sprechinstanz_in_vergangenheit_unmarkiert | sprechinstanz_in_vergangenheit_vergangenheit | sprechinstanz_in_vergangenheit_gegenwart | |
|---|---|---|---|---|
| 458 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1759 | NaN | 1.0 | 0.0 | 0.0 |
| 1289 | NaN | 1.0 | 0.0 | 0.0 |
| 2007 | 1.0 | 0.0 | 1.0 | 0.0 |
| 682 | NaN | 1.0 | 0.0 | 0.0 |
| 378 | NaN | 1.0 | 0.0 | 0.0 |
| 1967 | NaN | 1.0 | 0.0 | 0.0 |
| 1517 | NaN | 1.0 | 0.0 | 0.0 |
| 1325 | 1.0 | 0.0 | 1.0 | 0.0 |
| 1504 | NaN | 1.0 | 0.0 | 0.0 |
In [27]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
sprechakte [nominal_multi]: One-Hot Encoding¶
In [28]:
meta_all['sprechakte'] = [str(x) for x in meta_all['sprechakte']]
data = meta_all['sprechakte']
In [29]:
onehot_df = OneHotMulti(data, 'sprechakte')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
onehot_df = onehot_df.fillna(0)
In [30]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'sprechakte',
'sprechakte_Erzählen',
'sprechakte_Beschreiben',
'sprechakte_Auffordern',
]].sample(n=10)
Out[30]:
| sprechakte | sprechakte_Erzählen | sprechakte_Beschreiben | sprechakte_Auffordern | |
|---|---|---|---|---|
| 41 | Erzählen | 1 | 0 | 0 |
| 1056 | Auffordern + Beschreiben | 0 | 1 | 1 |
| 299 | Erzählen | 1 | 0 | 0 |
| 148 | Auffordern + Beschreiben | 0 | 1 | 1 |
| 495 | Erzählen | 1 | 0 | 0 |
| 332 | Erzählen | 1 | 0 | 0 |
| 23 | Erzählen | 1 | 0 | 0 |
| 1239 | Erzählen | 1 | 0 | 0 |
| 644 | Erzählen | 1 | 0 | 0 |
| 1026 | Auffordern + Beschreiben + Erzählen | 1 | 1 | 1 |
In [31]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
tempus [nominal_multi]: One-Hot Encoding¶
In [32]:
meta_all['tempus'] = [str(x) for x in meta_all['tempus']]
data = meta_all['tempus']
In [33]:
onehot_df = OneHotMulti(data, 'tempus')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
onehot_df = onehot_df.fillna(0)
In [34]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'tempus',
'tempus_Präsens',
'tempus_Präteritum',
'tempus_Futur',
]].sample(n=5)
Out[34]:
| tempus | tempus_Präsens | tempus_Präteritum | tempus_Futur | |
|---|---|---|---|---|
| 22 | Präsens + Präteritum | 1 | 1 | 0 |
| 1066 | Präsens + Präteritum | 1 | 1 | 0 |
| 257 | Präsens + Präteritum | 1 | 1 | 0 |
| 1250 | Präsens | 1 | 0 | 0 |
| 360 | Präsens | 1 | 0 | 0 |
In [35]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
wissen [nominal]: One-Hot-Encoding¶
In [36]:
# Nominaler Ansatz
data = meta_all['wissen']
data = data.replace({float('NaN') : 'neutral', 0 : 'ambivalent', 1 : 'wissend', -1 : 'unwissend'})
onehot_df = OneHotSimple(data, 'wissen')
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'wissen',
'wissen_neutral',
'wissen_wissend',
'wissen_unwissend',
'wissen_ambivalent',
]].sample(n=10)
Out[36]:
| wissen | wissen_neutral | wissen_wissend | wissen_unwissend | wissen_ambivalent | |
|---|---|---|---|---|---|
| 1041 | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
| 73 | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
| 1261 | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
| 837 | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
| 1707 | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
| 1209 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 132 | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
| 1371 | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
| 1645 | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
| 605 | NaN | 1.0 | 0.0 | 0.0 | 0.0 |
In [37]:
# Ordinaler Ansatz
# meta_all['wissen'] = meta_all['wissen'].replace({float('NaN') : 0})
In [38]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
beginn [intervall]: Vereinheitlichen (NaN mit Median ersetzen)¶
In [39]:
meta_all.query("beginn.isna()").shape[0]
Out[39]:
9
In [40]:
beginn_median = meta_all.query("korpus_anth and beginn.notna()")['beginn'].median()
beginn_median
Out[40]:
1521.0
In [41]:
meta_all['beginn'] = [x if pd.notna(x) else beginn_median for x in meta_all['beginn']]
In [42]:
meta_all.query("beginn.isna()").shape[0]
Out[42]:
0
In [43]:
features_used_add = pd.DataFrame({
'feature' : ['beginn'],
'encoding_orig' : ['interval'],
'encoding' : ['interval'],
'weight' : [1]
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
ende [intervall]: Vereinheitlichen (NaN mit Median ersetzen)¶
In [44]:
meta_all.query("ende.isna()").shape[0]
Out[44]:
9
In [45]:
ende_median = meta_all.query("korpus_anth and ende.notna()")['ende'].median()
ende_median
Out[45]:
1523.0
In [46]:
meta_all['ende'] = [x if pd.isna(x) == False else ende_median for x in meta_all['ende']]
In [47]:
meta_all.query("ende.isna()").shape[0]
Out[47]:
0
In [48]:
features_used_add = pd.DataFrame({
'feature' : ['ende'],
'encoding_orig' : ['interval'],
'encoding' : ['interval'],
'weight' : [1]
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
grossraum [nominal_multi]: One-Hot Encoding¶
In [49]:
meta_all['grossraum'] = [str(x) for x in meta_all['grossraum']]
data = meta_all['grossraum']
In [50]:
onehot_df = OneHotMulti(data, 'grossraum')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
onehot_df = onehot_df.fillna(0)
In [51]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'grossraum',
'grossraum_Europa',
'grossraum_Asien',
'grossraum_Afrika',
]].sample(n=10)
Out[51]:
| grossraum | grossraum_Europa | grossraum_Asien | grossraum_Afrika | |
|---|---|---|---|---|
| 1525 | Europa | 1 | 0 | 0 |
| 480 | Europa | 1 | 0 | 0 |
| 1070 | Europa | 1 | 0 | 0 |
| 728 | Europa | 1 | 0 | 0 |
| 280 | Europa | 1 | 0 | 0 |
| 278 | Europa | 1 | 0 | 0 |
| 844 | Europa | 1 | 0 | 0 |
| 1515 | Europa | 1 | 0 | 0 |
| 465 | Europa | 1 | 0 | 0 |
| 721 | Europa | 1 | 0 | 0 |
In [52]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
mittelraum [nominal_multi]: One-Hot Encoding¶
In [53]:
meta_all['mittelraum'] = [str(x) for x in meta_all['mittelraum']]
In [54]:
# mittelraum_deutsch = ["Heiliges Römisches Reich", "Proto-Deutschland", "Deutsches Kaiserreich", "Fränkisches Reich", "Germanien",
# "Ostfränkisches Reich", "Deutschland", "Deutschordensstaat", "Deutscher Bund"]
#
# for raum in mittelraum_deutsch:
# meta_all['mittelraum'] = [re.sub(raum, "Deutscher Raum", x) for x in meta_all['mittelraum']]
In [55]:
data = meta_all['mittelraum']
In [56]:
onehot_df = OneHotMulti(data, 'mittelraum')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
onehot_df = onehot_df.fillna(0)
In [57]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'mittelraum',
'mittelraum_Heiliges Römisches Reich',
'mittelraum_Frankreich',
'mittelraum_None'
]].sample(n=10)
Out[57]:
| mittelraum | mittelraum_Heiliges Römisches Reich | mittelraum_Frankreich | mittelraum_None | |
|---|---|---|---|---|
| 902 | Deutscher Raum | 0 | 0 | 0 |
| 844 | Deutscher Raum | 0 | 0 | 0 |
| 720 | Byzantinisches Reich | 0 | 0 | 0 |
| 855 | Deutscher Raum | 0 | 0 | 0 |
| 145 | Deutscher Raum | 0 | 0 | 0 |
| 243 | Großbritannien | 0 | 0 | 0 |
| 1501 | Kaisertum Österreich | 0 | 0 | 0 |
| 159 | Heiliges Römisches Reich | 1 | 0 | 0 |
| 1786 | Kaisertum Österreich | 0 | 0 | 0 |
| 1678 | Heiliges Römisches Reich | 1 | 0 | 0 |
In [58]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
kleinraum [nominal_multi]: One-Hot Encoding¶
In [59]:
meta_all['kleinraum'] = [str(x) for x in meta_all['kleinraum']]
data = meta_all['kleinraum']
In [60]:
onehot_df = OneHotMulti(data, 'kleinraum')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
onehot_df = onehot_df.fillna(0)
In [61]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'kleinraum',
'kleinraum_Paris',
'kleinraum_Berlin',
'kleinraum_Wien',
'kleinraum_None',
]].sample(n=10)
Out[61]:
| kleinraum | kleinraum_Paris | kleinraum_Berlin | kleinraum_Wien | kleinraum_None | |
|---|---|---|---|---|---|
| 359 | None | 0 | 0 | 0 | 1 |
| 513 | Wien | 0 | 0 | 1 | 0 |
| 1634 | Augsburg | 0 | 0 | 0 | 0 |
| 1968 | Paris | 1 | 0 | 0 | 0 |
| 270 | Bouvines | 0 | 0 | 0 | 0 |
| 761 | None | 0 | 0 | 0 | 1 |
| 665 | Mons Lactarius | 0 | 0 | 0 | 0 |
| 757 | Eresburg | 0 | 0 | 0 | 0 |
| 1954 | None | 0 | 0 | 0 | 1 |
| 845 | Kosel | 0 | 0 | 0 | 0 |
In [62]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
inhaltstyp [nominal_multi]: One-Hot Encoding¶
In [63]:
meta_all['inhaltstyp'] = [str(x) for x in meta_all['inhaltstyp']]
data = meta_all['inhaltstyp']
In [64]:
onehot_df = OneHotMulti(data, 'inhaltstyp')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
onehot_df = onehot_df.fillna(0)
In [65]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'inhaltstyp',
'inhaltstyp_Ereignis',
'inhaltstyp_Zustand',
]].sample(n=10)
Out[65]:
| inhaltstyp | inhaltstyp_Ereignis | inhaltstyp_Zustand | |
|---|---|---|---|
| 1312 | Zustand | 0 | 1 |
| 223 | Ereignis | 1 | 0 |
| 1585 | Ereignis + Zustand | 1 | 1 |
| 696 | Zustand | 0 | 1 |
| 1981 | Ereignis | 1 | 0 |
| 1003 | Zustand | 0 | 1 |
| 71 | Ereignis | 1 | 0 |
| 1740 | Ereignis | 1 | 0 |
| 1342 | Ereignis + Zustand | 1 | 1 |
| 1976 | Zustand | 0 | 1 |
In [66]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
stoffgebiete [nominal_multi_sim]¶
Germanet-Synsets für einzelne Stoffgebiete heraussuchen¶
In [67]:
from germanetpy.germanet import Germanet
data_path = "../resources/more/GN_V160_XML"
germanet = Germanet(data_path)
Load GermaNet data...: 100%|███▉| 99.99999999999996/100 [00:08<00:00, 12.04it/s] Load Wiktionary data...: 100%|████████████| 100.0/100 [00:00<00:00, 1001.49it/s] Load Ili records...: 100%|███████████████| 100.0/100 [00:00<00:00, 69155.88it/s]
In [68]:
stoffgebiete = stoffgebiet_ratings['type'].unique().tolist()
In [69]:
germanet_synsets = {}
germanet_synsets['Militär/Krieg'] = germanet.get_synsets_by_orthform("Krieg")[0]
germanet_synsets['Politik'] = germanet.get_synsets_by_orthform("Politik")[1]
germanet_synsets['Literatur'] = germanet.get_synsets_by_orthform("Literatur")[1]
germanet_synsets['Architektur'] = germanet.get_synsets_by_orthform("Architektur")[1]
germanet_synsets['Nation/Volk-iD'] = germanet.get_synsets_by_orthform("Region")[2]
germanet_synsets['Aufstand/Revolution'] = germanet.get_synsets_by_orthform("Aufstand")[0]
germanet_synsets['Erfindung/Innovation'] = germanet.get_synsets_by_orthform("Erfindung")[0]
germanet_synsets['Religion'] = germanet.get_synsets_by_orthform("Religion")[1]
germanet_synsets['Herrscherliches Handeln'] = germanet.get_synsets_by_orthform("herrschen")[1]
germanet_synsets['Arbeit'] = germanet.get_synsets_by_orthform("Arbeit")[2]
germanet_synsets['Ertrinken'] = germanet.get_synsets_by_orthform("ertrinken")[0]
germanet_synsets['Essen/Trinken'] = germanet.get_synsets_by_orthform("Nahrungsmittel")[0]
germanet_synsets['Jagd'] = germanet.get_synsets_by_orthform("Jagd")[1]
germanet_synsets['Musik'] = germanet.get_synsets_by_orthform("Musik")[2]
germanet_synsets['Recht'] = germanet.get_synsets_by_orthform("Recht")[2]
germanet_synsets['Nation/Volk-D'] = germanet.get_synsets_by_orthform("Deutschland")[0]
germanet_synsets['Auferstehung/Geister'] = germanet.get_synsets_by_orthform("Auferstehung")[0]
germanet_synsets['Denkmal'] = germanet.get_synsets_by_orthform("Denkmal")[1]
germanet_synsets['Sport'] = germanet.get_synsets_by_orthform("Sport")[1]
germanet_synsets['Waffen'] = germanet.get_synsets_by_orthform("Waffe")[0]
germanet_synsets['Ankunft'] = germanet.get_synsets_by_orthform("Ankunft")[1]
germanet_synsets['Natur'] = germanet.get_synsets_by_orthform("Natur")[3]
germanet_synsets['Nation/Volk-nD'] = germanet.get_synsets_by_orthform("Nation")[0]
germanet_synsets['Identitätsenthüllung'] = germanet.get_synsets_by_orthform("Identität")[1]
germanet_synsets['Kampf'] = germanet.get_synsets_by_orthform("Kampf")[3]
germanet_synsets['Eltern-Kind-Beziehung'] = germanet.get_synsets_by_orthform("Kindererziehung")[0]
germanet_synsets['Astronomie/Astrologie'] = germanet.get_synsets_by_orthform("Astronomie")[0]
germanet_synsets['Italiensehnsucht'] = germanet.get_synsets_by_orthform("Italien")[0]
germanet_synsets['Geburtstag'] = germanet.get_synsets_by_orthform("Geburtstag")[1]
germanet_synsets['Landwirtschaft'] = germanet.get_synsets_by_orthform("Landwirtschaft")[1]
germanet_synsets['Schlaf/Traum'] = germanet.get_synsets_by_orthform("Schlaf")[0]
germanet_synsets['Abschied'] = germanet.get_synsets_by_orthform("Abschied")[1]
germanet_synsets['Sprache'] = germanet.get_synsets_by_orthform("Sprache")[3]
germanet_synsets['Kindheit/Jugend'] = germanet.get_synsets_by_orthform("Kindheit")[0]
for stoffgebiet in stoffgebiete:
if stoffgebiet not in germanet_synsets:
synsets = germanet.get_synsets_by_orthform(stoffgebiet)
if len(synsets) > 0:
germanet_synsets[stoffgebiet] = synsets[0]
else:
germanet_synsets[stoffgebiet] = []
Distanzmatrix zwischen Texten erstellen auf Basis der Germanet-Abstände zwischen Stoffgebiet-Namen (stoffgebiet_dist)¶
In [70]:
meta_all['stoffgebiet'] = [str(x) for x in meta_all['stoffgebiet']]
stoffgebiete_all = meta_all['stoffgebiet']
stoffgebiete_all = [x.split(" + ") for x in stoffgebiete_all]
In [71]:
def get_multi_dist(stoffgebiete_a, stoffgebiete_b):
if stoffgebiete_a == stoffgebiete_b:
return 0
else:
distances = []
combinations = list(product(stoffgebiete_a, stoffgebiete_b))
for combination in combinations:
distance = get_single_dist(combination[0], combination[1])
distances.append(distance)
if all(np.isnan(x) for x in distances):
return float('NaN')
else:
return np.nanmean(distances)
def get_single_dist(stoffgebiet_a, stoffgebiet_b):
if stoffgebiet_a == stoffgebiet_b:
return 0
else:
stoffgebiet_a = germanet_synsets[stoffgebiet_a]
stoffgebiet_b = germanet_synsets[stoffgebiet_b]
try:
return stoffgebiet_a.shortest_path_distance(stoffgebiet_b)
except:
return float('NaN')
In [72]:
get_single_dist('Militär/Krieg', 'Liebe')
Out[72]:
10
In [73]:
get_multi_dist(['Militär/Krieg'], ['Liebe'])
Out[73]:
10.0
In [74]:
get_multi_dist(['Militär/Krieg', 'Religion'], ['Liebe'])
Out[74]:
12.5
In [75]:
get_multi_dist(['Militär/Krieg', 'Religion'], ['Liebe', 'Religion', 'Militär/Krieg'])
Out[75]:
8.5
In [76]:
stoffgebiete_dist = np.empty((len(stoffgebiete_all), len(stoffgebiete_all)))
stoffgebiete_dists = {}
for i, stoffgebiete_a in tqdm(enumerate(stoffgebiete_all), total = len(stoffgebiete_all)):
for j, stoffgebiete_b in enumerate(stoffgebiete_all):
# create strings as ids
stoffgebiete_ab_str = str(stoffgebiete_a)+"–"+str(stoffgebiete_b)
stoffgebiete_ba_str = str(stoffgebiete_b)+"–"+str(stoffgebiete_a)
# lookup
if stoffgebiete_ab_str in stoffgebiete_dists:
distance = stoffgebiete_dists[stoffgebiete_ab_str]
elif stoffgebiete_ba_str in stoffgebiete_dists:
distance = stoffgebiete_dists[stoffgebiete_ba_str]
# get distance
else:
distance = get_multi_dist(stoffgebiete_a, stoffgebiete_b)
stoffgebiete_dist[i, j] = distance
stoffgebiete_dists[stoffgebiete_ab_str] = distance
stoffgebiete_dist = pd.DataFrame(stoffgebiete_dist)
stoffgebiete_dist = stoffgebiete_dist.fillna(stoffgebiete_dist.mean().mean())
0%| | 0/2063 [00:00<?, ?it/s]
Test: Beispiele für konkrete Texte und errechnete Distanzen¶
In [77]:
sample_index = meta_all.sample(n=10).index
meta_all[['author', 'title', 'stoffgebiet']].loc[sample_index]
Out[77]:
| author | title | stoffgebiet | |
|---|---|---|---|
| 21 | Sturm, Julius | Ein Kunststück | Militär/Krieg |
| 1974 | Miegel, Agnes | Jane | Gefangenschaft + Liebe |
| 204 | Rocholl, R. | König Joram’s Abgötterei | Religion |
| 109 | Lingg, Hermann | Attilas Schwert | Politik |
| 570 | Mautner, Eduard | Admiral Tegetthoff | Militär/Krieg + Tod |
| 1918 | George, Stefan | Vom Ritter der sich verliegt | Rittertum |
| 1003 | Dahn, Felix | Lied Walthers von der Vogelweide | Literatur |
| 1270 | Hamerling, Robert | Der Brand Roms | Brand |
| 1521 | Lingg, Hermann | Der Kinder Kreuzfahrt | Religion |
| 640 | Gruppe, Otto Friedrich | Der Bauer und der Mohr | Essen/Trinken |
In [78]:
stoffgebiete_dist.loc[sample_index,sample_index]
Out[78]:
| 21 | 1974 | 204 | 109 | 570 | 1918 | 1003 | 1270 | 1521 | 640 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 21 | 0.0 | 9.5 | 13.0 | 2.0 | 4.0 | 17.0 | 13.0 | 8.0 | 13.0 | 10.0 |
| 1974 | 9.5 | 0.0 | 12.5 | 9.5 | 9.5 | 16.5 | 11.5 | 9.5 | 12.5 | 9.5 |
| 204 | 13.0 | 12.5 | 0.0 | 13.0 | 13.0 | 16.0 | 14.0 | 13.0 | 0.0 | 9.0 |
| 109 | 2.0 | 9.5 | 13.0 | 0.0 | 5.0 | 17.0 | 13.0 | 8.0 | 13.0 | 10.0 |
| 570 | 4.0 | 9.5 | 13.0 | 5.0 | 0.0 | 17.0 | 13.0 | 6.0 | 13.0 | 10.0 |
| 1918 | 17.0 | 16.5 | 16.0 | 17.0 | 17.0 | 0.0 | 18.0 | 17.0 | 16.0 | 13.0 |
| 1003 | 13.0 | 11.5 | 14.0 | 13.0 | 13.0 | 18.0 | 0.0 | 13.0 | 14.0 | 11.0 |
| 1270 | 8.0 | 9.5 | 13.0 | 8.0 | 6.0 | 17.0 | 13.0 | 0.0 | 13.0 | 10.0 |
| 1521 | 13.0 | 12.5 | 0.0 | 13.0 | 13.0 | 16.0 | 14.0 | 13.0 | 0.0 | 9.0 |
| 640 | 10.0 | 9.5 | 9.0 | 10.0 | 10.0 | 13.0 | 11.0 | 10.0 | 9.0 | 0.0 |
Test: Wie weit sind Stoffgebiete eines Texts im Mittelwert von allen anderen Stoffgebieten entfernt?¶
In [79]:
meta_all['dist_stoffgebiet_mean'] = stoffgebiete_dist.mean(axis = 1)
meta_all[[
'stoffgebiet' ,'dist_stoffgebiet_mean'
]].sort_values(by = 'dist_stoffgebiet_mean').drop_duplicates('stoffgebiet')
Out[79]:
| stoffgebiet | dist_stoffgebiet_mean | |
|---|---|---|
| 917 | Militär/Krieg + Treue/Gefolgschaft | 5.810355 |
| 89 | Militär/Krieg | 5.861842 |
| 118 | Militär/Krieg + Politik | 6.083889 |
| 429 | Militär/Krieg + Verbrechen | 6.163345 |
| 1148 | Sport + Militär/Krieg | 6.171585 |
| ... | ... | ... |
| 951 | Nation/Volk-nD | 16.253989 |
| 1303 | Stadt | 16.338656 |
| 2060 | Adel | 16.405791 |
| 1854 | Rittertum | 16.647833 |
| 351 | Genie | 16.692994 |
359 rows × 2 columns
Texte mit Stoffgebiet X (z. B. 'Militär/Krieg') sind in puncto Stoffgebiete im Mittelwert [dist_stoffgebiet_mean] von allen anderen Texten entfernt.
Distanzmatrix über MDS in 20 Dimensionen transformieren und diese (stoffgebiete_dim_1, stoffgebiete_dim_2 usw.) als Features hinzufügen¶
In [80]:
# from sklearn.manifold import MDS
# from scipy.spatial import distance
#
# n_components = 20
# random_states = range(17, 40)
#
# for random_state in random_states:
# model = MDS(n_components = n_components, random_state = random_state, dissimilarity = 'precomputed')
# column_names = ['stoffgebiete_dim_' + str(i+1) for i in range(n_components)]
# meta_all[column_names] = model.fit_transform(stoffgebiete_dist)
# stoffgebiete_centroid = meta_all[column_names].mean()
# for i, element in enumerate(meta_all.iloc):
# meta_all.at[i, 'dist_stoffgebiet_centroid_cosine'] = distance.cosine(element[column_names], stoffgebiete_centroid)
# print(f"random state : {random_state}")
# print(f"corr cosine : {meta_all[['dist_stoffgebiet_mean', 'dist_stoffgebiet_centroid_cosine']].corr().iloc[0,1]}")
# print(f"\n")
In [81]:
n_components = 20
model = MDS(n_components = n_components, random_state = 24, dissimilarity = 'precomputed')
In [82]:
column_names = ['stoffgebiete_dim_' + str(i+1) for i in range(n_components)]
In [83]:
meta_all[column_names] = model.fit_transform(stoffgebiete_dist)
Test: model stress¶
In [84]:
stress = model.stress_
stress1 = np.sqrt(stress / (0.5 * np.sum(stoffgebiete_dist.values**2)))
print(f"sklearn stress : {stress}")
print(f"Kruskal's stress : {stress1}")
sklearn stress : 238485.3679399986 Kruskal's stress : 0.034216049330838744
Test: werden in ersten zwei Dimensionen stoffliche Unterschiede (hier: Recht vs. Other) sichtbar?¶
In [85]:
meta_all['concrete_stoffgebiet'] = ['Recht' if x == 'Recht' else 'Other' for x in meta_all['stoffgebiet']]
px.scatter(
meta_all,
x = 'stoffgebiete_dim_1',
y = 'stoffgebiete_dim_2',
color = 'concrete_stoffgebiet',
hover_data = ['stoffgebiet']
)
Test: Wie hängt Zentralität laut Distanzmatrix (stoffgebiete_dist) zusammen mit Zentralität laut Dimensionsmodell? Ist Text, der sehr niedrigen 'dist_stoffgebiet_mean'-Wert hat, auch nah am Zentroiden des Dimensionsmodells?¶
In [86]:
stoffgebiete_centroid = meta_all[column_names].mean()
for i, element in enumerate(meta_all.iloc):
meta_all.at[i, 'dist_stoffgebiet_centroid_manhattan'] = cityblock(element[column_names], stoffgebiete_centroid)
meta_all.at[i, 'dist_stoffgebiet_centroid_euclidean'] = euclidean(element[column_names], stoffgebiete_centroid)
meta_all.at[i, 'dist_stoffgebiet_centroid_cosine'] = cosine(element[column_names], stoffgebiete_centroid)
In [87]:
# dist_stoffgebiet_centroid_cosine hängt stark von random_state ab, ist praktisch zufällig (kann auch negativ sein)
meta_all[[
'dist_stoffgebiet_mean',
'dist_stoffgebiet_centroid_manhattan', 'dist_stoffgebiet_centroid_euclidean', 'dist_stoffgebiet_centroid_cosine'
]].corr()
Out[87]:
| dist_stoffgebiet_mean | dist_stoffgebiet_centroid_manhattan | dist_stoffgebiet_centroid_euclidean | dist_stoffgebiet_centroid_cosine | |
|---|---|---|---|---|
| dist_stoffgebiet_mean | 1.000000 | 0.987770 | 0.993300 | -0.824851 |
| dist_stoffgebiet_centroid_manhattan | 0.987770 | 1.000000 | 0.995385 | -0.799814 |
| dist_stoffgebiet_centroid_euclidean | 0.993300 | 0.995385 | 1.000000 | -0.799277 |
| dist_stoffgebiet_centroid_cosine | -0.824851 | -0.799814 | -0.799277 | 1.000000 |
Übertragung¶
In [88]:
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi_sim'] * len(column_names),
'encoding' : ['interval'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
stoffgebiet_bewertung [nominal_multi_dependent]¶
In [89]:
stoffgebiet_ratings['rating'] = stoffgebiet_ratings['rating'].astype(int)
In [90]:
all_types = stoffgebiet_ratings['type'].sort_values().unique()
all_ratings = stoffgebiet_ratings['rating'].sort_values().unique()
In [91]:
# Nominaler Ansatz
column_names = []
for this_type in all_types:
for this_rating in all_ratings:
column_names.append('stoffgebiet_bewertung_' + this_type + '_' + str(this_rating))
onehot_df = pd.DataFrame(columns = column_names)
for i, element in tqdm(enumerate(meta_all.iloc), total = meta_all.shape[0]):
this_title = element.title
this_author = element.author
this_ratings = stoffgebiet_ratings.query("author == @this_author and title == @this_title")
for rating in this_ratings.iloc:
this_type = rating.type
this_rating = rating.rating
onehot_df.at[i, 'stoffgebiet_bewertung_' + this_type + '_' + str(this_rating)] = 1
0%| | 0/2063 [00:00<?, ?it/s]
In [92]:
# Ordinaler Ansatz
# stoffgebiet_ratings['rating'] = stoffgebiet_ratings['rating'].replace({3 : 0, 2 : -1})
#
# column_names = []
# for this_type in all_types:
# column_names.append('stoffgebiet_bewertung_' + this_type)
#
# onehot_df = pd.DataFrame(columns = column_names)
#
# for i, element in tqdm(enumerate(meta_all.iloc), total = meta_all.shape[0]):
# this_title = element.title
# this_author = element.author
# this_ratings = stoffgebiet_ratings.query("author == @this_author and title == @this_title")
#
# for rating in this_ratings.iloc:
# this_type = rating.type
# this_rating = rating.rating
# onehot_df.at[i, 'stoffgebiet_bewertung_' + this_type] = this_rating
In [93]:
onehot_df = onehot_df.fillna(0)
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3506148593.py:1: FutureWarning:
Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
In [94]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'stoffgebiet',
'stoffgebiet_bewertung',
'stoffgebiet_bewertung_Militär/Krieg_1',
'stoffgebiet_bewertung_Militär/Krieg_2',
'stoffgebiet_bewertung_Politik_1',
]].sample(n=10)
Out[94]:
| stoffgebiet | stoffgebiet_bewertung | stoffgebiet_bewertung_Militär/Krieg_1 | stoffgebiet_bewertung_Militär/Krieg_2 | stoffgebiet_bewertung_Politik_1 | |
|---|---|---|---|---|---|
| 576 | Liebe + Politik | 1 + 1 | 0 | 0 | 1 |
| 1644 | Militär/Krieg | 1 | 1 | 0 | 0 |
| 442 | Tod + Politik | 1 + 2 | 0 | 0 | 0 |
| 1508 | Militär/Krieg | 3 | 0 | 0 | 0 |
| 153 | Politik | 1 | 0 | 0 | 1 |
| 166 | Militär/Krieg + Politik | 3 + 1 | 0 | 0 | 1 |
| 1060 | Denkmal | 1 | 0 | 0 | 0 |
| 1356 | Jagd + Herrscherliches Handeln | 0 + 3 | 0 | 0 | 0 |
| 767 | Tod + Politik | 1 + 0 | 0 | 0 | 0 |
| 821 | Friede + Politik | 2 + 1 | 0 | 0 | 1 |
In [95]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi_dependent'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
entity_simple [nominal_multi]¶
In [96]:
meta_all['entity_simple'] = [str(x) for x in meta_all['entity_simple']]
data = meta_all['entity_simple']
In [97]:
onehot_df = OneHotMulti(data, 'entity_simple')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning:
Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
In [98]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'entity_simple',
'entity_simple_1',
'entity_simple_2',
'entity_simple_3',
]].sample(n=10)
Out[98]:
| entity_simple | entity_simple_1 | entity_simple_2 | entity_simple_3 | |
|---|---|---|---|---|
| 1225 | 1 + 3 | 1 | 0 | 1 |
| 940 | 1 + 1 | 2 | 0 | 0 |
| 1894 | 3 + 3 | 0 | 0 | 2 |
| 1186 | 1 + 3 | 1 | 0 | 1 |
| 99 | 1 + 1 + 3 | 2 | 0 | 1 |
| 1189 | 3 + 2 | 0 | 1 | 1 |
| 971 | 1 + 1 | 2 | 0 | 0 |
| 344 | 1 + 1 + 3 + 3 | 2 | 0 | 2 |
| 373 | 1 | 1 | 0 | 0 |
| 1665 | 2 + 2 + 2 | 0 | 3 | 0 |
In [99]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi'] * len(column_names),
'encoding' : ['interval'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
entity_bewertung [nominal_multi_dependent]¶
In [100]:
entity_ratings['type'] = entity_ratings['type'].replace({'1 ' : '1'})
entity_ratings['rating'] = entity_ratings['rating'].astype(int)
In [101]:
all_types = entity_ratings['type'].unique()
all_rating_types = entity_ratings['rating'].unique()
In [102]:
# Nominaler Ansatz
column_names = []
for this_type in all_types:
for this_rating_type in all_rating_types:
column_names.append('entity_bewertung_' + this_type + '_' + str(this_rating_type))
onehot_df = pd.DataFrame(index = meta_all.index, columns = column_names)
onehot_df = onehot_df.fillna(0)
for i, element in tqdm(enumerate(meta_all.iloc), total = meta_all.shape[0]):
this_title = element.title
this_author = element.author
this_ratings = entity_ratings.query("author == @this_author and title == @this_title")
for rating in this_ratings.iloc:
this_type = rating.type
this_rating = rating.rating
onehot_df.at[i, 'entity_bewertung_' + this_type + '_' + str(this_rating)] += 1
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3100512533.py:9: FutureWarning:
Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
0%| | 0/2063 [00:00<?, ?it/s]
In [103]:
# Ordinaler Ansatz
# entity_ratings['rating'] = entity_ratings['rating'].replace({3 : 0, 2 : -1})
#
# column_names = []
# for this_type in all_types:
# column_names.append('entity_bewertung_' + this_type)
#
# onehot_df = pd.DataFrame(columns = column_names)
#
# for i, element in tqdm(enumerate(meta_all.iloc), total = meta_all.shape[0]):
# this_title = element.title
# this_author = element.author
# this_ratings = entity_ratings.query("author == @this_author and title == @this_title")
#
# for this_type in this_ratings['type'].unique():
# this_ratings_values = this_ratings.query("type == @this_type")['rating']
# this_ratings_mean = this_ratings_values.mean()
# onehot_df.at[i, 'entity_bewertung_' + this_type] = this_ratings_mean
In [104]:
onehot_df = onehot_df.fillna(0)
In [105]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'entity_simple',
'entity_bewertung',
'entity_bewertung_1_1',
'entity_bewertung_1_2',
'entity_bewertung_3_1',
]].sample(n=10)
Out[105]:
| entity_simple | entity_bewertung | entity_bewertung_1_1 | entity_bewertung_1_2 | entity_bewertung_3_1 | |
|---|---|---|---|---|---|
| 1873 | 4 | 0 | 0 | 0 | 0 |
| 1017 | 1 + 1 + 1 + 3 + 1 | 3 + 2 + 2 + 0 + 0 | 0 | 2 | 0 |
| 1157 | 2 + 3 | 1 + 2 | 0 | 0 | 0 |
| 26 | 4 + 3 | 3 + 0 | 0 | 0 | 0 |
| 504 | 3 + 3 + 1 + 1 | 3 + 1 + 1 + 1 | 2 | 0 | 1 |
| 1688 | 4 | 1 | 0 | 0 | 0 |
| 1622 | 1 + 1 + 3 | 1 + 3 + 1 | 1 | 0 | 1 |
| 775 | 1 + 1 + 3 | 0 + 2 + 2 | 0 | 1 | 0 |
| 482 | 1 + 3 + 1 | 1 + 2 + 0 | 1 | 0 | 0 |
| 1185 | 1 + 2 | 3 + 3 | 0 | 0 | 0 |
In [106]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi_dependet'] * len(column_names),
'encoding' : ['interval'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
marker_person : Systematisieren (aufteilen in Titel/Text)¶
In [107]:
meta_all['marker_person'].value_counts().sort_index()
Out[107]:
marker_person / 411 Text 622 Titel 118 Titel + Text 912 Name: count, dtype: int64
In [108]:
meta_all['marker_person_title'] = [1 if 'Titel' in x else 0 for x in meta_all['marker_person']]
meta_all['marker_person_text'] = [1 if 'Text' in x else 0 for x in meta_all['marker_person']]
In [109]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : ['marker_person_title', 'marker_person_text'],
'encoding_orig' : ['bin_multi', 'bin_multi'],
'encoding' : ['bin', 'bin'],
'weight' : [0.5, 0.5]
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
marker_zeit : Systematisieren (aufteilen in Titel/Text)¶
In [110]:
meta_all['marker_zeit'].value_counts().sort_index()
Out[110]:
marker_zeit / 1191 Text 773 Titel 48 Titel + Text 51 Name: count, dtype: int64
In [111]:
meta_all['marker_zeit_title'] = [1 if 'Titel' in x else 0 for x in meta_all['marker_zeit']]
meta_all['marker_zeit_text'] = [1 if 'Text' in x else 0 for x in meta_all['marker_zeit']]
In [112]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : ['marker_zeit_title', 'marker_zeit_text'],
'encoding_orig' : ['bin_multi', 'bin_multi'],
'encoding' : ['bin', 'bin'],
'weight' : [0.5, 0.5]
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
marker_place : Systematisieren (aufteilen in Titel/Text)¶
In [113]:
meta_all['marker_ort'].value_counts().sort_index()
Out[113]:
marker_ort / 1506 Text 354 Titel 66 Titel + Text 137 Name: count, dtype: int64
In [114]:
meta_all['marker_ort_title'] = [1 if 'Titel' in x else 0 for x in meta_all['marker_ort']]
meta_all['marker_ort_text'] = [1 if 'Text' in x else 0 for x in meta_all['marker_ort']]
In [115]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : ['marker_ort_title', 'marker_ort_text'],
'encoding_orig' : ['bin_multi', 'bin_multi'],
'encoding' : ['bin', 'bin'],
'weight' : [0.5, 0.5]
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
marker_objekt : Systematisieren (aufteilen in Titel/Text)¶
In [116]:
meta_all['marker_objekt'].value_counts().sort_index()
Out[116]:
marker_objekt / 848 Text 1109 Titel 22 Titel + Text 84 Name: count, dtype: int64
In [117]:
meta_all['marker_objekt_title'] = [1 if 'Titel' in x else 0 for x in meta_all['marker_objekt']]
meta_all['marker_objekt_text'] = [1 if 'Text' in x else 0 for x in meta_all['marker_objekt']]
In [118]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : ['marker_objekt_title', 'marker_objekt_text'],
'encoding_orig' : ['bin_multi', 'bin_multi'],
'encoding' : ['bin', 'bin'],
'weight' : [0.5, 0.5]
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
ueberlieferung_bewertung [nominal]: One-Hot-Encoding¶
In [119]:
meta_all['ueberlieferung_bewertung'] = [str(x) for x in meta_all['ueberlieferung_bewertung']]
data = meta_all['ueberlieferung_bewertung']
In [120]:
onehot_df = OneHotMulti(data, 'ueberlieferung_bewertung')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning:
Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
In [121]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'ueberlieferung_bewertung',
'ueberlieferung_bewertung_neutral',
'ueberlieferung_bewertung_positiv',
'ueberlieferung_bewertung_None',
]].sample(n=10)
Out[121]:
| ueberlieferung_bewertung | ueberlieferung_bewertung_neutral | ueberlieferung_bewertung_positiv | ueberlieferung_bewertung_None | |
|---|---|---|---|---|
| 1085 | None | 0 | 0 | 1 |
| 1051 | positiv | 0 | 1 | 0 |
| 945 | None | 0 | 0 | 1 |
| 1005 | None | 0 | 0 | 1 |
| 1477 | positiv | 0 | 1 | 0 |
| 1832 | None | 0 | 0 | 1 |
| 1624 | None | 0 | 0 | 1 |
| 1929 | None | 0 | 0 | 1 |
| 401 | None | 0 | 0 | 1 |
| 1064 | neutral | 1 | 0 | 0 |
In [122]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
geschichtsvorstellung_bewertung [nominal]: One-Hot-Encoding¶
In [123]:
meta_all['geschichtsauffassung_bewertung'] = [str(x) for x in meta_all['geschichtsauffassung_bewertung']]
data = meta_all['geschichtsauffassung_bewertung']
In [124]:
onehot_df = OneHotMulti(data, 'geschichtsauffassung_bewertung')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning:
Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
In [125]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'geschichtsauffassung_bewertung',
'geschichtsauffassung_bewertung_positiv',
'geschichtsauffassung_bewertung_negativ',
'geschichtsauffassung_bewertung_None',
]].sample(n=10)
Out[125]:
| geschichtsauffassung_bewertung | geschichtsauffassung_bewertung_positiv | geschichtsauffassung_bewertung_negativ | geschichtsauffassung_bewertung_None | |
|---|---|---|---|---|
| 414 | None | 0 | 0 | 1 |
| 1310 | None | 0 | 0 | 1 |
| 1114 | None | 0 | 0 | 1 |
| 1196 | None | 0 | 0 | 1 |
| 752 | None | 0 | 0 | 1 |
| 1430 | None | 0 | 0 | 1 |
| 777 | None | 0 | 0 | 1 |
| 464 | None | 0 | 0 | 1 |
| 376 | None | 0 | 0 | 1 |
| 1812 | None | 0 | 0 | 1 |
In [126]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
verhaeltnis_wissen [nominal]: One-Hot-Encoding¶
In [127]:
meta_all['verhaeltnis_wissen'] = [str(x) for x in meta_all['verhaeltnis_wissen']]
data = meta_all['verhaeltnis_wissen']
In [128]:
onehot_df = OneHotMulti(data, 'verhaeltnis_wissen')
/var/folders/45/zsyytpq97xq280z_cvw88j240000gn/T/ipykernel_2519/3792900534.py:13: FutureWarning:
Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
In [129]:
if onehot_df.columns[0] not in meta_all.columns:
meta_all = meta_all.join(onehot_df).copy()
meta_all[[
'verhaeltnis_wissen',
'verhaeltnis_wissen_übereinstimmend',
'verhaeltnis_wissen_ergänzend',
'verhaeltnis_wissen_abweichend_übernatürlich',
]].sample(n=10)
Out[129]:
| verhaeltnis_wissen | verhaeltnis_wissen_übereinstimmend | verhaeltnis_wissen_ergänzend | verhaeltnis_wissen_abweichend_übernatürlich | |
|---|---|---|---|---|
| 911 | ergänzend | 0 | 1 | 0 |
| 1359 | abweichend_übernatürlich | 0 | 0 | 1 |
| 1560 | abweichend_übernatürlich | 0 | 0 | 1 |
| 1774 | ergänzend | 0 | 1 | 0 |
| 798 | abweichend_übernatürlich | 0 | 0 | 1 |
| 485 | ergänzend | 0 | 1 | 0 |
| 1230 | ergänzend | 0 | 1 | 0 |
| 1551 | übereinstimmend | 1 | 0 | 0 |
| 1291 | abweichend_übernatürlich | 0 | 0 | 1 |
| 475 | ergänzend | 0 | 1 | 0 |
In [130]:
column_names = onehot_df.columns
features_used_add = pd.DataFrame({
'feature' : column_names,
'encoding_orig' : ['nominal_multi'] * len(column_names),
'encoding' : ['bin'] * len(column_names),
'weight' : [1/len(column_names)] * len(column_names)
})
features_used_df = pd.concat(
[features_used_df,
features_used_add
]).reset_index(drop = True)
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
Übersicht¶
In [131]:
features_used_df = features_used_df.drop_duplicates(subset = 'feature')
In [132]:
features_used = features_used_df['feature'].tolist()
In [133]:
features_used_df
Out[133]:
| feature | encoding | weight | encoding_orig | |
|---|---|---|---|---|
| 0 | geschichtslyrik | ordinal | 1.00 | NaN |
| 1 | empirisch | bin | 1.00 | NaN |
| 2 | theoretisch | bin | 1.00 | NaN |
| 3 | sprechinstanz_markiert | bin | 1.00 | NaN |
| 4 | konkretheit | ordinal | 1.00 | NaN |
| ... | ... | ... | ... | ... |
| 1137 | geschichtsauffassung_bewertung_ambivalent | bin | 0.20 | nominal |
| 1138 | verhaeltnis_wissen_ergänzend | bin | 0.25 | nominal_multi |
| 1139 | verhaeltnis_wissen_übereinstimmend | bin | 0.25 | nominal_multi |
| 1140 | verhaeltnis_wissen_abweichend_übernatürlich | bin | 0.25 | nominal_multi |
| 1141 | verhaeltnis_wissen_abweichend_natürlich | bin | 0.25 | nominal_multi |
1142 rows × 4 columns
In [134]:
meta_all.sample(n=10)[features_used]
Out[134]:
| geschichtslyrik | empirisch | theoretisch | sprechinstanz_markiert | konkretheit | vergangenheitsdominant | zeitebenen | fixierbarkeit | anachronismus | gegenwartsbezug | ... | ueberlieferung_bewertung_negativ | geschichtsauffassung_bewertung_None | geschichtsauffassung_bewertung_positiv | geschichtsauffassung_bewertung_neutral | geschichtsauffassung_bewertung_negativ | geschichtsauffassung_bewertung_ambivalent | verhaeltnis_wissen_ergänzend | verhaeltnis_wissen_übereinstimmend | verhaeltnis_wissen_abweichend_übernatürlich | verhaeltnis_wissen_abweichend_natürlich | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 206 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1543 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1282 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1412 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 333 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 921 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1654 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 0.0 | 0.0 | 1.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 131 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1592 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1256 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
10 rows × 1142 columns
Umbenennen und Export¶
In [135]:
features_used_export_df = features_used_df
features_used_export_df['feature'] = ['vectortyp_' + x for x in features_used_export_df['feature']]
features_used_export_df.to_csv("../resources/more/vectors/vectordist_features.csv")
In [136]:
features_used_export_df
Out[136]:
| feature | encoding | weight | encoding_orig | |
|---|---|---|---|---|
| 0 | vectortyp_geschichtslyrik | ordinal | 1.00 | NaN |
| 1 | vectortyp_empirisch | bin | 1.00 | NaN |
| 2 | vectortyp_theoretisch | bin | 1.00 | NaN |
| 3 | vectortyp_sprechinstanz_markiert | bin | 1.00 | NaN |
| 4 | vectortyp_konkretheit | ordinal | 1.00 | NaN |
| ... | ... | ... | ... | ... |
| 1137 | vectortyp_geschichtsauffassung_bewertung_ambivalent | bin | 0.20 | nominal |
| 1138 | vectortyp_verhaeltnis_wissen_ergänzend | bin | 0.25 | nominal_multi |
| 1139 | vectortyp_verhaeltnis_wissen_übereinstimmend | bin | 0.25 | nominal_multi |
| 1140 | vectortyp_verhaeltnis_wissen_abweichend_übernatürlich | bin | 0.25 | nominal_multi |
| 1141 | vectortyp_verhaeltnis_wissen_abweichend_natürlich | bin | 0.25 | nominal_multi |
1142 rows × 4 columns
In [137]:
export_meta = meta_all[['id'] + features_used]
export_meta.columns = ['vectortyp_' + x if x != 'id' else x for x in export_meta.columns]
export_meta.to_csv("../resources/more/vectors/vectordist.csv")
In [138]:
export_meta.head()
Out[138]:
| id | vectortyp_geschichtslyrik | vectortyp_empirisch | vectortyp_theoretisch | vectortyp_sprechinstanz_markiert | vectortyp_konkretheit | vectortyp_vergangenheitsdominant | vectortyp_zeitebenen | vectortyp_fixierbarkeit | vectortyp_anachronismus | ... | vectortyp_ueberlieferung_bewertung_negativ | vectortyp_geschichtsauffassung_bewertung_None | vectortyp_geschichtsauffassung_bewertung_positiv | vectortyp_geschichtsauffassung_bewertung_neutral | vectortyp_geschichtsauffassung_bewertung_negativ | vectortyp_geschichtsauffassung_bewertung_ambivalent | vectortyp_verhaeltnis_wissen_ergänzend | vectortyp_verhaeltnis_wissen_übereinstimmend | vectortyp_verhaeltnis_wissen_abweichend_übernatürlich | vectortyp_verhaeltnis_wissen_abweichend_natürlich | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1850.Grube.028 | 1.0 | 1.0 | 0.0 | 1.0 | 0.5 | 1.0 | 3.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1850.Kriebitzsch.001 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.5 | 2.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 1850.Kriebitzsch.011 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 1.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1850.Kriebitzsch.019 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 | 1.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 1851.Müller/Kletke.018 | 1.0 | 1.0 | 0.0 | 1.0 | 0.5 | 1.0 | 2.0 | 0.0 | 0.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 1143 columns